Indexing Schemes for Similarity Search: an Illustrated Paradigm

نویسندگان

  • Vladimir Pestov
  • Aleksandar Stojmirovic
چکیده

We suggest a variation of the Hellerstein— Koutsoupias—Papadimitriou indexability model for datasets equipped with a similarity measure, with the aim of better understanding the structure of indexing schemes for similarity-based search and the geometry of similarity workloads. This in particular provides a unified approach to a great variety of schemes used to index into metric spaces and facilitates their transfer to more general similarity measures such as quasi-metrics. We discuss links between performance of indexing schemes and high-dimensional geometry. The concepts and results are illustrated on a very large concrete dataset of peptide fragments equipped with a quasi-metric tree indexing scheme.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Resource Description and Selection for Range Query Processing in General Metric Spaces

Similarity search in general metric spaces is a key aspect in many application fields. Metric space indexing provides a flexible indexing paradigm and is solely based on the use of a distance metric. No assumption is made about the representation of the database objects. Nowadays, ever-increasing data volumes require large-scale distributed retrieval architectures. Here, local and global indexi...

متن کامل

A Short Survey on P2P Data Indexing

P2P data indexing has recently attracted a great many research efforts. For various proposed schemes, there are generally two taxonomies: 1) From a systematic point of view, existing schemes fall into two categories: the over-DHT indexing paradigm, which as a layered manner, indexes data in DHT key space (i.e., over DHT), and the overlay-dependent indexing paradigm, which indexes data directly ...

متن کامل

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrixbased similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations, and datasets can have on the order of 60 million ...

متن کامل

Fuzzy retrieval of encrypted data by multi-purpose data-structures

The growing amount of information that has arisen from emerging technologies has caused organizations to face challenges in maintaining and managing their information. Expanding hardware, human resources, outsourcing data management, and maintenance an external organization in the form of cloud storage services, are two common approaches to overcome these challenges; The first approach costs of...

متن کامل

Curse of Dimensionality in the Application of Pivot-based Indexes to the Similarity Search Problem

In this work we study the validity of the so-called curse of dimensionality for indexing of databases for similarity search. We perform an asymptotic analysis, with a test model based on a sequence of metric spaces (Ωd) from which we pick datasets Xd in an i.i.d. fashion. We call the subscript d the dimension of the space Ωd (e.g. for R d the dimension is just the usual one) and we allow the si...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Fundam. Inform.

دوره 70  شماره 

صفحات  -

تاریخ انتشار 2006